Dual scheduling of work from multiple sources to multiple sinks using source and sink attributes to achieve fairness and processing efficiency

ABSTRACT

A method and apparatus for assigning work, such as data packets, from a plurality of sources, such as data queues in a network processing device, to a plurality of sinks, such as processor threads in the network processing device. In a given processing period, a source is selected in a manner that maintains fairness in the selection process. A corresponding sink is selected for the selected source based on processing efficiency. If, due to assignment constraints, no sink is available for the selected source, the selected source is retained for selection in the next scheduling period, to maintain fairness. In this case, to optimize efficiency, a most efficient currently available sink is identified and a source for providing work to that sink is selected.

The present application is related to Attorney docket No.RPS920090098US1, U.S. patent application Ser. No. ______ entitledAssigning Work from Multiple Sources to Multiple Sinks; and Attorneydocket No. RPS920090097US1, U.S. patent application Ser. No. ______entitled Assignment Constraint Matrix for Assigning Work from MultipleSources to Multiple Sinks filed on even date herewith and assigned tothe assignee of the present application, the details of which areincorporated herein by reference.

BACKGROUND

1. Field

The disclosure relates generally to systems for processing data frommultiple sources by multiple processors, such as network processingdevices, and more specifically to systems and methods for assigning workin the form of data packets from multiple data queue sources to multipleprocessing thread sinks using source attributes and sink attributes toachieve both assignment fairness and processing efficiency.

2. Description of the Related Art

Network processing devices, such as routers, switches and intelligentnetwork adapters, are comprised of a network component, which receivesincoming data traffic, and a finite set of processing elements, that areemployed to process the incoming data. Network processing devicesroutinely partition incoming traffic into different segments for thepurpose of providing network segment specific quality of service (QoS).Examples of quality of service parameters are bandwidth limitationenforcement on one particular segment or bandwidth weighting and/orprioritization across all segments. It is commonplace to associate aqueue with each segment into which incoming data is divided. Incomingdata packets are placed into the queue of their associated segment asthey are received.

A queue scheduler is used to determine an order in which the queues areto be served by the device processing elements. For example, the queuescheduler may determine the next queue that is to be served. The next inline data packet, or other work item, from the selected queue is thenplaced into a single service queue. The processing elements retrievedata packets from the single service queue to provide the requiredprocessing for the retrieved data packet. It is commonplace to usepolling or other interrupts to notify one or more of the processingelements when data packets are available for retrieval from the singleservice queue for processing.

Increasingly, the processing elements are comprised of multiple computecores or processing units. Each core may be comprised of multiplehardware threads sharing the resources of the core. Each thread may beindependently capable of processing incoming data packets. Using aconventional queue scheduler, only one thread at a time can get datafrom the single service queue.

Network processing system software increasingly desires to constrainwhich threads can service which queues in order to create locality ofwork. A conventional queue scheduler polls the status of all queues todetermine the next best suited queue to process without reference tosuch constraints.

As the number of data queues increases, the time required in order tomake a scheduling decision, also known as the scheduling period, alsoincreases. For example, a device that is to support 100 Gbps networktraffic comprised of small 64 byte packets needs to support a throughputof roughly 200 million packets per second. On a 2 GHz system, thisimplies that a scheduling decision needs to be accomplished in less than10 clock cycles. In conventional queue schedulers, queues are attachedto a queue inspection set, often referred to as a ring, when queuestatus is changed from empty to not-empty. Similarly, queues aredetached from the queue inspection set when queue status is changed fromnot-empty to empty. Use of a queue inspection set limits the number ofqueues that need to be examined by the queue scheduler during ascheduling period, since the queue scheduler need only examine queueshaving data to be processed, and these are the not-empty queues attachedto the queue inspection set.

SUMMARY

A method and apparatus for scheduling work from a plurality of sourcesto a plurality of sinks using source attributes and sink attributes toachieve assignment fairness and processing efficiency is disclosed. Inan illustrative embodiment, the plurality of sources are data queues,such as data queues in a network processing device, the work is datapackets on the data queues and awaiting processing, and the sinks areprocessing threads, such as threads on a plurality of processor cores ofthe networking processing device.

In a given scheduling period, a first source from the plurality ofsources is selected based on a source attribute, such as a sourceattribute that is selected to maintain source selection fairness. Basedon any assignment constraints between the plurality of sources and theplurality of sinks, it is determined whether a sink is available forwork from the first source. When a sink is available for the firstsource, a first sink is selected for the first source based on a sinkattribute, such as a sink attribute related to processing efficiency.

When a sink is not available for the first source, to maintain fairness,the first source is retained for selection as the first source in thenext scheduling period. In this case, to optimize efficiency, a secondsink is selected from the plurality of sinks based on sink attributes,such as sink attributes related to processing efficiency. A secondsource to provide work for processing by the second sink is selectedbased on any assignment constraints and source attributes, such assource attributes related to source selection fairness.

In accordance with an illustrative embodiment, the first source isselected by a basic scheduler that stays on the first source but defersto a complement scheduler when a sink is not currently available for thefirst source. The complement scheduler selects the second source and thesecond sink. The basic and complement schedulers may operatesimultaneously and in parallel.

Further objects, features, and advantages will be apparent from thefollowing detailed description and with reference to the accompanyingdrawings.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 is a functional block diagram of a system incorporating anapparatus and method for assigning work from multiple sources tomultiple sinks using a dual scheduler to achieve assignment fairness andprocessing efficiency in accordance with an illustrative embodiment.

FIG. 2 is a schematic block diagram of a network processing device inwhich an apparatus and method for assigning work from multiple sourcesto multiple sinks in accordance with an illustrative embodiment may beimplemented.

FIG. 3 is a schematic block diagram of an apparatus for assigning workfrom multiple sources to multiple sinks in accordance with anillustrative embodiment.

FIG. 4 is a flow chart diagram showing steps of a method for assigningwork from multiple sources to multiple sinks using dual scheduling inaccordance with an illustrative embodiment.

DETAILED DESCRIPTION

A method and apparatus for scheduling work from multiple sources tomultiple sinks using source attributes and sink attributes to achieveassignment fairness and processing efficiency is disclosed. Illustrativeembodiments will be described in detail herein with reference to theexample of application in network processing devices in which themultiple sources are multiple data queues, the multiple sinks aremultiple threads, and the work is in the form of data packets that areto be assigned from the data queues to the threads for processing. Itshould be understood that other embodiments may be implemented in otherapplications for matching types of work that are different from thosedescribed by example herein from a plurality of sources that aredifferent from those described by example herein to a plurality of sinksthat are different from those described by example herein.

The different illustrative embodiments recognize and take into account anumber of different considerations. For example, the differentillustrative embodiments recognize and take into account that as thenumber of cores and threads in network processing devices increases, theassignment of work, in the form of data packets, to threads forprocessing via a single service queue, as employed in conventional queuescheduling, becomes problematic. The scalability of conventional methodsfor assigning data packets to cores is limited due to contention on thesingle service queue.

Furthermore, the different illustrative embodiments recognize and takeinto account that conventional queue scheduling is not adapted torespond effectively to constraints on which threads can service whichqueues. Such constraints can cause problems that unnecessarily limitsystem performance in systems where conventional queue scheduling isemployed. A conventional queue scheduler polls the status of all queuesto determine the next best suited queue to process without reference tosuch constraints. A constraint imposed by a thread to queue assignmentmay prevent data from the selected queue from being dispatched to athread for processing if all of the threads assigned to that queue arebusy. At the same time, other threads, that might be servicing otherqueues, may remain idle waiting for the data from the selected queue tobe cleared from the single service queue so that another queue may beselected for processing. This condition violates the fundamentalrequirement of work conservation. Work conservation is defined as theproperty of a data processing system that no resources shall be idlewhile there still is work to be done. In this case, processingcores/threads that could be processing data packets are idle while datapackets to be processed remain in the queues.

The different illustrative embodiments recognize and take into accountthat queue scheduling in a high speed networking environment, such as 10Gbps, 40 Gbps, or up to 100 Gbps networks, poses a challenge where ahigh number of queues need to be processed in the limited cycle budget.In addition, the high clock frequency required for the high speednetworking environment also limits the number of queues that can beprocessed in each clock cycle. In a queue scheduler all queues need tobe inspected for data to be processed. Electrical propagation delaysassociated with this inspection put a limit on the number of queues thatcan be inspected.

The different illustrative embodiments recognize and take into accountthat conventional methods for dealing with increases in the number ofdata queues to be serviced by processing threads by using queueinspection sets to reduce the time required to make a schedulingdecision cannot be applied effectively to the case where there areconstraints on which threads can service which queues. Application ofqueue inspection sets in the context of such constraints would implyassociating a queue inspection set or inspection ring with each threador other sink. However, this approach becomes infeasible due to the factthat multiple attach/detach operations, one for each thread or othersink that is eligible for a particular queue, at each queue statuschange would not be able to be accomplished in the time constraints setby the scheduling interval.

The different illustrative embodiments recognize and take into accountthat it is also desirable to provide a form of load balancing of packetprocessing over cores in a multiple core system. Load balancingpreferably is implemented such that roughly the same number of threadsare active on each core. Conventional queue scheduling does not supportsuch load balancing.

The different illustrative embodiments also recognize and take intoaccount that when a cluster of dedicated source queues is handled by acluster of dedicated sink threads, without any overlap of the differenttypes of data traffic between queues and threads, it is possible toschedule the source queues with fairness by placing the queues in eachcluster on a ring served by the corresponding cluster of threads, or byother simple methods of providing fairness of scheduling among thedifferent traffic types. However, when any combination of affinitybetween source queues and sink threads is possible, that is, whenmultiple data types from multiple sources may be processed by multipleoverlapping sets of sinks, such fairness is much more difficult toprovide. Furthermore, in such a case, where a given source queue can beserved by multiple different sink threads, given a source queue selectedto be processed in a given scheduling period, a sink thread to processthe work from the selected queue must also be selected. The thread maybe selected from among those threads with the best processingcapability, such as threads belonging to less loaded cores and/or moreefficient cores, such as cores having larger caches, etc. But such aselection can completely undermine fairness at the system level, leadingto some data traffic on some source queues never being served.

The illustrative embodiments disclose a method and apparatus thatprovides a solution for scheduling work from multiple sources tomultiple sinks with optimum efficiency at the system level, e.g., bygiving priority to less loaded sinks at a given time, while solving theproblem of unfairness that may arise when such priority is given to lessloaded processing sinks. The illustrative embodiments provide ascheduling solution that can be implemented for very high packet trafficthroughput, e.g., 100G Ethernet, with any type of affinity betweentraffic queues and threads.

As illustrated in FIG. 1, an apparatus or method in accordance with anillustrative embodiment may find application in any data processingenvironment 100 in which work 102 from multiple sources 104 is to bedirected to multiple sinks 106 for processing. In a particularillustrative embodiment, an apparatus or method in accordance with anillustrative embodiment is adapted for use in data processingenvironment 100 such as network processing device 108. Networkprocessing device 108 may be any known type of network processingdevice, such as a router, network switch, and/or an intelligent networkadapter.

In accordance with an illustrative embodiment, sources 104 may includedata queues 110. In this case, as well as in other illustrativeembodiments, work 102 may include data packets 112, such as data packets112 on queues 110. Data packets 112 comprise, for example, data trafficflowing through network processing device 108.

In accordance with an illustrative embodiment, sinks 106 may include aplurality of processor threads 114. For example, multiple threads 114may be provided on multiple processor cores 116. Each of the pluralityof cores 116 may provide one or more threads 114. During any particularscheduling or selection period, one or more sinks 106, such as one ormore threads 114, may be available 118. Sink 106 generally is available118 if sink 106 is not busy processing, and thus is available to receiveand process work 102 from source 104.

In accordance with an illustrative embodiment, sources 104 and sinks 106are subject to one or more assignment constraints 120. Assignmentconstraints 120 define which sinks 106 may process work 102 from whichsources 104. Thus, assignment constraints 120 may also be said to definewhich sources 104 may provide work 102 to which sinks 106.

In accordance with an illustrative embodiment, assignment constraints120 may be implemented in qualifier matrix 122. Qualifier matrix 122implements assignment constraints 120 such that by providing sinks 106to qualifier matrix 122 sources 104 that are qualified to provide work102 to such sinks 106 are identified. Similarly, sinks 106 that arequalified to receive work 102 from sources 104 are identified byproviding such sources 104 to qualifier matrix 122.

In accordance with an illustrative embodiment, source scheduler 126selects a single selected source 128 each scheduling period. Selectedsource 128 is the source 104 from which work 102 will be directed to aselected sink 106 for processing in a given scheduling period. Thus,source scheduler 126 may be coupled to qualifier matrix 122 to receivequalified source information and qualified sink information fromqualifier matrix 122. Source scheduler 126 selects selected source 128from among those sources 104 that have work 102 in the currentscheduling period. Thus, selected source 128 is a source 104 from whichwork 102 may be assigned to an available 118 sink 106 in the currentscheduling period.

In accordance with an illustrative embodiment, source scheduler 126 mayinclude a hierarchical structure 130 that can process a high number ofsources 104 in parallel each clock cycle by using a modular design witheach module processing a subset of sources 104 to be processed. Forexample, hierarchical scheduler 130 may include a plurality of firstlevel scheduler modules 132. Each first level scheduler module 132operates preferably simultaneously in parallel with other first levelscheduler modules 132 to select an intermediate selected source from asubset of sources 104. Preferably the various subsets of sources 104processed by each first level module 132 do not overlap. Theintermediate selected sources from first level modules 132 are providedto second level module 134. Second level module 134 selects a selectedsource 104 from the intermediate selected sources. In accordance with anillustrative embodiment, first level modules 132 and/or second levelmodule 134 may implement their respective selections using a round robinselection process and/or structure. Thus, source scheduler 126 inaccordance with the illustrative embodiment solves the problem ofprocessing a large number of sources 104 for scheduling in one clockcycle. As parallelism is achieved with the modular design, hierarchicalsource scheduler 130 in accordance with the illustrative embodiment iscapable of deriving a scheduling decision within the allotted schedulingperiod to meet high speed networking performance requirements.

In accordance with an illustrative embodiment, source scheduler 126 mayimplement a multi-priority scheduler 136. Multi-priority scheduler 136allows selected source 128 to be selected from sources 104 having higherpriority before being selected from sources 104 having lower priority.In accordance with an illustrative embodiment, multi-priority scheduler136 may include a plurality of prioritized scheduler slices 138. Eachscheduler slice 138 selects an intermediate selected source from asubset of sources 104. Each subset of sources 104 processed by ascheduler slice 138 has a different priority level from the prioritylevel of the subsets processed by other scheduler slices 138.Intermediate selected sources from prioritized scheduler slices 138 areprovided to selector 140. Selector 140 selects as selected source 128the intermediate selected source from prioritized scheduler slice 138processing the subset of sources 104 having the highest priority level.

In accordance with an illustrative embodiment, source scheduler 126 mayimplement dual scheduler 142. Dual scheduler 142 uses source attributesand sink attributes to assign work 102 from sources 104 to sinks 106 ina manner that achieves both assignment fairness and processingefficiency. Dual scheduler 142 includes first or basic scheduler 144 andsecond or complement scheduler 146.

Each scheduling period, basic scheduler 144 selects a source 104 havingwork to be performed based on source attributes. For example, basicscheduler 144 may select a source 104 in a manner so as to preserveassignment fairness among sources 104, such as in a conventional roundrobin fashion. If at least one sink 106 is available to process work 102from the source 104 selected by basic scheduler 144, the source selectedby basic scheduler 144 becomes the selected source 128, and an availablesink is then selected to process work 102 from selected source 128. Ifno sink 106 is available to process work from the source 104 selected bybasic scheduler 144, due to assignment constraints 120, basic scheduler144 does not make another selection, but maintains its current selectionfor selection again during the next scheduling period. Therefore, thesource 104 selected by basic scheduler 144 based on source attributes isnot bypassed, and work 102 from this source 104 will be assigned to asink 106 at the next scheduling period for which a sink 106 that isallowed by assignment constraints 120 to process work 102 from thatsource 104 becomes available. Thus, basic scheduler 144 provides forscheduling fairness, ensuring that all sources 104 are serviced in anorder as defined by source attributes.

Complement scheduler 146 selects a source 104 for which work 102 is tobe processed in any scheduling period for which the source 104 selectedby basic scheduler 144 cannot be processed by an available sink 118.Complement scheduler 146 identifies one or more sinks 106 that may mostefficiently process work 102 in the current scheduling period. Thisselection is based on one or more sink attributes related to efficiency.Based on the selected sink, complement scheduler 146 selects theselected source 128 as a source 104 that has work 102 available in thecurrent scheduling period and that may be processed by one of theselected sinks 106, in accordance with given assignment constraints 120.Complement scheduler 146 may employ one or more source attributes inmaking a selection from among multiple potential sources 104 that mayprovide work 102 to the sinks 106 selected by complement scheduler 146,such as to maintain fairness among such potential sources 104. Thus,complement scheduler 146 enhances efficiency by ensuring that work 102from a source 104 is provided for processing by the most efficient sink106 in each processing period for which basic scheduler 144 cannotprovide the source 104 from which work 102 is to be processed in orderto maintain assignment fairness.

In accordance with an illustrative embodiment, sink scheduler 148selects available 118 sink 106 that is qualified to receive work 102from selected source 128. Sink scheduler 148 preferably is coupled tosource scheduler 126 to receive selected source 128 and to qualifiermatrix 122 to receive available 118 sinks 106 qualified to receive work102 from selected source 128. Any desired method or structure may beused to select a qualified available 118 sink 106 from multiplequalified and currently available 118 sinks 106 for selected source 128.

In accordance with an illustrative embodiment, where sinks 106 includemultiple threads 114 on multiple cores 116, sink scheduler 148 mayinclude core scheduler 150 and thread scheduler 152. Core scheduler 150selects a core 116 containing an available thread 114 that is qualifiedto receive work 102 from selected source 128. Core scheduler 150preferably selects a core 116 based on a workload of the core 116 oranother attribute indicative of core efficiency. For example, corescheduler 150 may select from among cores 116 containing availablethreads 114 that are qualified to receive work 102 from selected source128 that core 116 having a smallest number or percentage of activethreads 114 or a largest number or percentage of available threads 114.Thread scheduler 152 then selects a single qualified available thread114 on the core 116 selected by core scheduler 150 using any desiredmethod or structure.

In accordance with an illustrative embodiment, packet injector 154 isprovided to provide work 102 from selected source 128 to the available118 sink 106 selected by sink scheduler 148.

The illustration of FIG. 1 is not meant to imply physical orarchitectural limitations to the manner in which different advantageousembodiments may be implemented. Other components in addition and/or inplace of the ones illustrated may be used. Some components may beunnecessary in some advantageous embodiments. Also, the blocks arepresented to illustrate some functional components. One or more of theseblocks may be combined and/or divided into different blocks whenimplemented in different advantageous embodiments.

For example, as will be discussed in more detail below, source scheduler126 may include hierarchical 130, multi-priority 136, and/or dual 142scheduler functions in one or more various combinations. For example,each prioritized scheduler slice 138 of a multi-priority scheduler 136may be implemented as a hierarchical scheduler 130 having multiple firstlevel scheduler modules 132 and second level scheduler module 134 or asa dual scheduler 142 having basic scheduler 144 and complement scheduler146 for each priority level. As another example, dual scheduler 142 mayemploy a hierarchical scheduler 130 to implement basic scheduler 144and/or complement scheduler 146, in whole or in part.

The block diagram of FIG. 2 shows network processing device 200 in whichan apparatus and method for assigning work from multiple sources tomultiple sinks in accordance with an illustrative embodiment may beimplemented. In this example, network processing device 200 is anexample of one implementation of network processing device 108 ofFIG. 1. Network processing device 200 represents one example of anenvironment in which an apparatus and/or method in accordance with anillustrative embodiment may be implemented.

Network processing device 200 includes network component 202 andprocessing component 204. Processor bus 206 connects network component202 to processing component 204. Processor bus 206 also providesinterface 208 to other data processing units, such as to processingunits on other chips where network processing device 200 is implementedas a multiple chip system.

Network component 202 sends and receives data packets via high speednetwork interfaces 210. Received packets are processed initially bypacket pre-classifier 212. For example, packet pre-classifier 212 maypartition incoming traffic into different segments for the purpose ofproviding network segment specific quality of service (QoS) or for someother purpose as may be defined by a user via host interface 214. Datapackets sorted by packet pre-classifier 212 are directed to ingresspacket queues 216. For example, one or more queues 216 may be associatedwith each segment into which incoming data is divided by packetpre-classifier 212.

Processing component 204 may include a plurality of processor cores 218,220, 222, and 224. Although in the example embodiment illustratedprocessing component 204 includes four cores 218, 220, 222, and 224, itshould be understood that network processing device 200 in accordancewith an illustrative embodiment may include more or fewer coresimplemented on one or more processor chips. Each of cores 218, 220, 222,and 224 may support one or more processing threads 226, 228, 230, and232, respectively. In accordance with an illustrative embodiment, eachof cores 218, 220, 222, and 224, preferably may contain any number ofthreads 226, 228, 230, and 232 as may be required or desired for aparticular implementation.

Data packets in queues 216 are sent to threads 226, 228, 230, and 232for processing via processor bus 206. Queues 216 are examples of sourcesof work. The data packets in queues 216 are examples of work to beprocessed. Threads 226, 228, 230, and 232 are examples of sinks for thework. In accordance with an illustrative embodiment, data packets fromqueues 216 are assigned to threads 226, 228, 230, and 232 for processingby scheduler 234. As will be discussed in more detail below, scheduler234 in accordance with an illustrative embodiment includes qualifiermatrix 236, source scheduler 238, and sink scheduler 240. Thesecomponents provide an apparatus and method for effectively assigningpackets from multiple queues 216 to multiple threads 226, 228, 230, and232 given assignment constraints on which threads 226, 228, 230, and 232may process work from which queues 216.

The block diagram of FIG. 3 shows scheduler apparatus 300 for assigningwork from multiple sources 302 to multiple sinks 304 in accordance withan illustrative embodiment. Apparatus 300 includes qualifier matrix 306,source scheduler 308, and sink scheduler 310. In this example, qualifiermatrix 306 is an example of one implementation of qualifier matrix 122of FIG. 1 and of qualifier matrix 236 of FIG. 2. Source scheduler 308 isan example of one implementation of source scheduler 126 of FIG. 1 andof source scheduler 238 of FIG. 2. Sink scheduler 310 is an example ofsink scheduler 148 of FIG. 1 and of sink scheduler 240 of FIG. 2.

The assignment of work from sources 302 to sinks 304 is subject to a setof assignment constraints 312. Each source 302, for example, a dataqueue 314, is associated with a set of sinks 304, for example, workingthreads 316, that are allowed to work on work from said source 302. Whena particular sink 304 is not busy it declares itself available and readyto process new work, such as a new data packet. This logically makes allsources 302 that contain the available sink 304 in their worker seteligible in the next scheduling period to be selected to provide work tosink 304. Qualifier matrix 306 captures this eligibility relationshipand hence maps the set of ready or available sinks 304 to a set ofqualified sources which is presented to source scheduler 308. Sourcescheduler 308 selects from the overlap of all qualified and non-emptysources 302 the next source 302 to provide the next work in accordancewith an internal source scheduler algorithm. Once a source 302 isselected, sink scheduler 310 determines the most appropriate sink 304 toexecute the work. Where the sink 304 is a thread 316 executing on a core318, sink scheduler 310 may first determine the most appropriate core318 to execute the work based on the workload of the core 318. Sinkscheduler 310 then selects the next thread 316 on that selected core 318to receive the work. Finally, the next work from the source 302 selectedby the source scheduler 302 is sent to the sink 304 selected by sinkscheduler 310 by, for example, packet injector 320. The selected sink304 is declared busy and the next scheduling cycle commences.

Scheduler 300 supports a finite set of sinks 304. In this example it isassumed that sinks 304 are processing elements of an apparatus comprisedof a plurality of cores 318. Each core 318 is comprised of a set ofthreads 316. Each thread 316 shares underlying core resources with otherthreads 316 of the same core 318. As a result of the sharing ofprocessor resources, such as pipeline, cache, translation lookasidebuffer (TLB), etc., among threads 316 of a single core 318, it isdesirable to dispatch work to the core 318 that is least loaded withrunning threads 316. Threads 316 that are idle consume fewer resources,for example, in the processor pipeline, than threads 316 that areactive. So the number of running threads 316 in core 318 is anindication of how busy that core 318 is.

Scheduler 300 also supports a finite set of sources 302. In this examplesources 302 are data queues 314. Associated with each source 302 areassignment constraints defined by source-sink assignment mask 312.Source-sink assignment mask 312 indicates which sinks 304 are in generalallowed to handle work from which sources 302. For example, source-sinkassignment mask 312 may be implemented such that a bit vector isprovided for each supported source 302 with a bit of the bit vectorprovided for each supported sink 304. A bit of the bit vector may be setif a particular sink 304 is in general allowed to handle work from aparticular source 302. In accordance with an illustrative embodiment,the source-sink assignment constraints defined by source-sink assignmentmask 312 may be set or changed at any time. In most cases, however, theassignment constraints defined by source-sink assignment mask 312 aredefined at a configuration and setup time of scheduler apparatus 300.

The assignment constraints defined by source-sink assignment mask 312are implemented in qualifier matrix 306. Qualifier matrix 306 isessentially a copy of source-sink assignment mask 312. Qualifier matrix306 is a two dimensional matrix having a row (or column) for eachsupported source 302 and a column (or row) for each supported sink 304.Thus, in accordance with an illustrative embodiment, qualifier matrix306 may be used to determine which sources 302 are qualified to sendwork to a given sink 304 and which sinks 304 are qualified to receivework from a given source 302.

In an illustrative embodiment, qualifier matrix 306 may be implementedusing multiple qualifier sub-matrixes as disclosed in U.S. PatentApplication entitled Assignment Constraint Matrix for Assigning Workfrom Multiple Sources to Multiple Sinks filed on even data herewith andassigned to the assignee of the present application, the details ofwhich are incorporated herein by reference.

When a sink 304 is ready for work it announces its “readiness” oravailability. Notification of sink availability may be achieved byproviding sink ready mask 322 having a “ready” bit corresponding to eachsupported sink 304. When a sink 304 is available and ready for work, thecorresponding “ready” bit in the sink ready mask 322 is set. One way ofachieving setting such a bit where sink 304 is a thread 316 on a core318 is through memory-mapped input/output (MMIO) operations. The readythread 316 may then optionally go to sleep, for example, through memorywait operations, to reduce its footprint on core 318 resources.

Optionally, one or more various system constraints 324 also may affectwhich sinks 304 are available to perform work in any given schedulingperiod. For example, system constraints 324 may dictate that certainsinks 304 are declared never to participate in a scheduling decision.System constraints 324 may be implemented in system constraints mask326.

All sinks 304 that are available for work in the current schedulingperiod, that is, all sinks to which work can be dispatched, may bedetermined based on sinks 304 that have indicated that they are readyfor work in the sink ready mask 322 and any other system constraints 324that may affect sink 304 availability as defined by system constraintmask 326. The resulting set of available sinks 304 may be referred to assink pressure 328. Sink pressure 328 may be provided in the form of asink pressure bit vector having a bit corresponding to each supportedsink 304, wherein a sink pressure bit for a particular sink 304 is setif that sink 304 is determined to be available for work.

All supported sources 302 that have work available to be performed aredetermined. The resulting set of sources 302 that have work to beperformed may be referred to as the source pressure 330. Source pressure330 may be provided in the form of a bit vector having a bitcorresponding to each supported source 302, wherein a source pressurebit for a particular source 302 is set if the source 302 is determinedto have work available to be performed. As a result, a source 302changing status from or to empty or not-empty requires that only asingle bit value be switched.

The next source 302 for which work is to be performed in a givenscheduling period is selected by source scheduler 308 by selecting onesource 302 from among those that have work to be performed as indicatedsource pressure bit vector 330. Source scheduler 308 may make thisselection based on any scheduling method or algorithm for selecting themost appropriate source 302 from among the eligible sources 302. Forexample, in accordance with an illustrative embodiment, source scheduler308 may make this selection using a dual scheduler as disclosed hereinto maintain fairness and to improve efficiency. The selected source 302may be indicated in selected source bit vector 332 having a bit for eachsupported source 302 and wherein one selected source bit correspondingto the selected source 302 is set.

A sink 304 to which work from the selected source 302 is to be assignedis selected. In accordance with an illustrative embodiment, where sinks304 include multiple threads 316 on multiple cores 318, sink 304selection preferably includes first determining a core 318 to which thework from the selected source 302 is to be dispatched by sink scheduler310 based on the selected source 302, as indicated in selected sourcebit vector 322, and thread pressure 328 indicating available threads316. For example, to determine a core 318 to which work is to bedispatched, eligible threads 316 to which work from the selected source302 may be directed may be determined by an AND operation ofcorresponding bits of source-sink assignment mask 312, indicatingthreads 316 allowed to perform work for the selected source 302, andthread pressure 328, using qualifier matrix 306. The result may beprovided as a thread-schedulable mask in the form of a bit vector havinga bit corresponding to each supported thread 316, wherein athread-schedulable bit for any particular supported thread 316 is set ifit is determined that work may be dispatched from the selected source302 to that thread 316. The determined eligible threads 316 may be usedto determine eligible cores 318 by multiplexing the thread-schedulablemask into a core bit of a core eligibility mask. A core 318 is eligibleif any of the eligible threads 316 belong to that core 318. The coreeligibility mask includes a bit vector having a bit corresponding toeach of the system cores 318, wherein a core eligible bit correspondingto a particular core 318 is set if the bit for any of its threads 316 isset in the thread-schedulable mask. One of the eligible cores 318 isselected to receive work from the selected source 302, preferably basedon core workload considerations. For example, an eligible core 318 thathas the most idle, or largest proportion of idle, threads 316 may beselected. Alternatively, some other workload based or other criteria maybe used to select a core 318 from among determined eligible cores 318.

Having selected a core 318, one of the available threads 316 on the core318 that is allowed to perform work for the selected source 302 isselected to receive work from the selected source 302 by sink scheduler310 by selecting a thread 316 from the selected core 318 for which thethread-schedulable bit in the thread-schedulable mask is set. Anydesired criteria and method may be used to select from among selectablethreads, such as using a round robin selection process.

Work is retrieved from the selected source 302 and dispatched to thethread 316 that has been selected as the sink 304 to work on it bypacket injector 320. Packet injector 320 may, for example, notify theselected thread 316 that it has been selected to receive a work packet.This notification may be provided via a memory touch, if the thread 316was waiting on a memory location. The selected thread 316 may then bemarked as busy. This may be accomplished by clearing the ready bit forthis thread in sink ready mask 322.

Qualifier matrix 306 and sink scheduler 310 may be implemented withmultiplexers and bit masking to operate in one cycle. Source scheduler308 may require more complex implementation.

The flow chart diagram of FIG. 4 shows steps of method 400 for making ascheduling decision in accordance with an illustrative embodiment.Method 400 may be implemented by scheduler 126 of FIG. 1, and moreparticularly by dual scheduler 142 of FIG. 1. Method 400 may beimplemented in the scheduler apparatus 300 of FIG. 3, or in a similar orentirely different apparatus.

For each scheduling period, method 400 selects a source for which workis to be processed by using basic scheduler process 402. Basic schedulerprocess 402 selects a source for which work is to be processed based onsource attributes (step 404). In other words, step 404 includesselecting a source based solely on attributes related to the source,such as the source having work available to be performed. For example,to maintain fairness among sources, step 404 may include selecting asource from among sources having work to be performed in a conventionalround robin fashion. Other source attributes that may be considered instep 404 to maintain fairness among sources may include staticattributes, such as tenure related to source priority or other factors,and dynamic attributes, such as the amount of time that a source hasbeen waiting to be serviced and/or the amount of work that a source hasto be processed. Step 404 does not take into account attributes of thesystem sinks, such as whether or not a sink is available to process workfrom any given source.

It is then determined whether a sink is available to process work fromthe selected source (step 406). Step 406 includes determining whichsinks may process work from the selected source given any assignmentconstraints and whether any of these sinks are currently available. Ifat least one sink is determined in step 406 to be available to processwork from the selected source, one of the available sinks is selected toreceive the work from the selected source in the current schedulingperiod (step 408). Step 408 may include selecting one sink from amongthe available sinks that may process the work from the selected sourcebased on sink attributes related to efficiency. For example, where thesinks are threads operating on cores, step 406 may include choosing athread based on static thread or core attributes, such as the size ofthe cache available to a thread or core, and/or based on dynamicattributes, such as the number of currently busy threads on a core,actual cache occupancy, etc. Fairness is guaranteed by basic schedulerprocess 402 by selecting a source based on source attributes related tofairness alone. This fairness is not modified by the subsequentselection of a sink in step 408 that takes into account efficiencyconsiderations, but efficiency is increased if the sink selection isbased on such efficiency considerations. Work from the selected sourcemay then be dispatched to the selected sink (step 410). Step 410completes the scheduling process 400 for the current scheduling period.

If the source selected in step 404 has work to be performed, but thereis no sink available to perform the work from this source, the nextsource with work to be performed and having a sink available to performthe work might be selected. However, bypassing the initially selectedsource in this manner does not maintain fairness. Furthermore, merelyselecting another source does not optimize global system efficiency,even when the work is dispatched to the most efficient sink currentlyavailable to process work from the selected source. Since sources areselected in basic scheduler process 402 based on source attributesalone, the sinks available to process work from any source selected bybasic scheduler process 402 may be less efficient than other systemsinks that are currently available, but that may not be able to processwork from the selected source due to assignment constraints.

In accordance with an illustrative embodiment, in response todetermining at step 406 that no sink currently is available for theselected source, instead of selecting another source with work to beperformed, basic scheduler process 402 does not advance and stays withthe selected source (step 412). Thus, step 412 ensures that the sourceselected in step 404 will be selected again by base scheduler process402 the next scheduling period, and will continue to be selected by basescheduler process 402 until a sink becomes available to process workfrom the selected source. In the case where base scheduler process 402does not select a source from which work can be processed in a givenscheduling period, base scheduler process 402 defers to complementscheduler process 414 for selecting a source from which work may beprovided to an available sink in that scheduling period (step 416). Step416 completes basic scheduling process 402 for the current schedulingperiod.

Complement scheduler process 414 begins by selecting one or more sinksbased on desirable sink attributes (step 418). Step 418 may includeselecting one or more sinks that are currently available to process workfrom any source based on sink attributes related to processingefficiency. For example, where the sinks are multiple threads operatingon multiple cores, the sink attributes employed in step 418 may includestatic attributes, such as the size of the cache available to a threador core, and/or dynamic attributes, such as the number of currently busythreads on a core, actual cache occupancy, etc. Since fairness isprovided among sources by basic scheduler process 402, step 418 may makea sink selection based entirely on efficiency considerations related tosink attributes, and without regard to any fairness or otherconsiderations related to source attributes.

Complement scheduler process 414 then selects a single source havingwork to be performed and that, in accordance with any assignmentconstraints, may provide that work to a sink selected in step 418 (step420). Step 420 may include selecting a single source from among severalsources that may have work that may be processed by the one or moreselected sinks based on one or more source attributes. Such sourceattributes may include, for example, static attributes, such as tenurerelated to source priority or other factors, and dynamic attributes,such as the amount of time that a source has been waiting to be servicedor the amount of work that a source has to be processed. Use of suchsource attributes may introduce some fairness into the complementscheduler process 414, which bases its selection initially on efficiencyconsiderations related to sink attributes alone. Use of such sourceattributes for source selection in step 420 does not modify theefficiency which is guaranteed at the system level by selecting the mostefficient sink in step 418, but may increase fairness if the sourceselection is made using source attributes such as tenure, amount of workthat the source has to be processed, or the like.

Multiple sinks, such as multiple sinks having the same level ofefficiency, may have been selected in step 418. Thus, the sourceselected in step 420 may have available more than one of the sinksselected in step 418 to which work from the source may be dispatched. Inthis case, one sink is selected from among the sinks selected in step418 that may, in accordance with any assignment constraints, processwork from the source selected in step 420 (step 422). Step 422 mayinclude any process for selecting a single sink from among candidatesinks, such as using a simple round robin process. Alternatively, step422 in complement scheduler process 414 may include the same processemployed in step 408 in basic scheduler process 402.

It is determined in complement scheduler process 414 whether or notbasic scheduler process 402 has deferred to the selections made bycomplement scheduler process 414 in the current scheduling period (step424). When it is determined in step 424 that basic scheduler process 402has deferred to complement scheduler process 414 in the currentscheduling period, work from the source selected in step 420 isdispatched to the sink selected in step 422 in step 410. As mentionedabove, step 410 completes method 400 for the scheduling period.Complement scheduler process 414 is completed for the scheduling periodwhen it is determined in step 424 that basic scheduler process 402 hasnot deferred to complement scheduler process 414 in the currentscheduling period.

The use of basic scheduler process 402 in combination with complementscheduler process 414 in method 400 optimizes the balance betweenefficiency and fairness in selecting a source from among multiplesources having work to be processed and selecting a sink from amongmultiple sinks that may process work from the selected source in anygiven processing period. In accordance with the illustrative embodiment,basic scheduling process 402 selects a source from among sources havingwork available in order to maintain fairness with respect to thesources, even if there is no sink currently available to process workfrom the selected source. Complement scheduler process 414 improvesefficiency by giving priority to more efficient sinks, before selectingthe source from which work is to be processed.

Dual scheduling in accordance with an illustrative embodiment may beimplemented in a high performance packet dispatcher for handling 10Gbps, 40 Gbps, or even 100 Gbps packet traffic. In this case, schedulingin accordance with an illustrative embodiment may not be implementedusing finite state machines performing selections, e.g., the queueselection, the core selection, and the thread selection, one after theother, based on observations of scheduling elements performed eachscheduling period. In accordance with an illustrative embodiment, theselection functionality preferably is split among separate entities,implemented using logic circuitry, in order that each selection may becompleted in one selection cycle.

In accordance with an illustrative embodiment, the sources may be aplurality of data queues, the work includes data packets on theplurality of data queues, and the plurality of sinks are a plurality ofprocessor threads adapted to process the data packets. The threads maybe independent, served by different queues and even different priorityplanes. Thus, the scheduling selection of selecting a sink in the basicscheduler may not follow the flow of choosing the least loaded corefirst and then choosing a thread inside this less loaded core. Inaccordance with an illustrative embodiment, thread selection by thebasic scheduler may be made in a first cycle without regard to totalcore load or weight. Later, in another cycle, the core weight may beapplied to select the best core. In another cycle a thread in the bestcore may be selected.

In accordance with an illustrative embodiment, a mechanism may beprovided to select a queue in which packets are pre-classified. Thismechanism may be composed of different sets of queues arranged indifferent sets of priorities, comprising information on each queuestatus and a remembrance of a last selected queue. When a queue isselected, another mechanism selects all threads that are eligible, i.e.,all threads that are not busy and that are not masked for the queue byany assignment constraints. This mechanism may include a set of maskbits per queue, with each bit defining the ability of a given thread tohandle traffic from the queue, and a separate mechanism indicatingwhether a given thread is busy or free. Examples of these mechanisms aredescribed above with reference to FIG. 3. The selection of threads maythen be performed by logic circuitry looking at the information providedby the described mechanisms in parallel, thus making a selectiondecision in one propagation time of one system clock cycle through thelogical circuits.

In accordance with an illustrative embodiment, a mechanism may beprovided which computes a “weight” for each core based on the number offree threads in a given core. Core weight preferably is calculatedwithout regard to the queue or priority plane that has fed the corethreads. A mechanism is provided which then selects only one core fromamong all of the threads that have been selected by the mechanismdescribed above using the weight of each core and selecting the corewith the highest weight. This mechanism may include logic circuitry forapplying the core weight on the relevant threads and logic circuitryselecting the core with the highest weight. The selection of the corethus may be made by logic circuitry looking at all of the necessaryinformation in parallel, and making the decision in one propagation timeof one system clock cycle through the logic circuits.

In accordance with an illustrative embodiment, a mechanism for selectinga thread in the selected core may include logic circuitry applying theprevious thread selection remembrance and logical circuitry forselecting the thread in round robin fashion. The selection of the threadthus may be made by logic circuitry looking at all of the necessaryinformation in parallel and making the decision in one propagation timeof one system clock cycle through the logic circuits.

In another illustrative embodiment, a mechanism may be provided suchthat at least one thread can reserve a core for itself, with theproperty that if all of the cores corresponding to the eligible threadshave at least one thread reserving the core, then any core can beselected.

In a complement scheduler in accordance with an illustrative embodiment,a core is selected before a queue is selected. The selected corepreferably is the least loaded core that has at least one thread that isfree and that is eligible to handle traffic coming from any or all ofthe queues. In accordance with an illustrative embodiment, such coreselection preferably is performed by logic circuitry looking at allnecessary information in parallel and making the core selection decisionin one propagation time of one system clock cycle through these logiccircuits.

In accordance with an illustrative embodiment, each core may advertiseits weight on a global weight line that is provided to all other cores.Each core checks it own weight against the weight of other cores asprovided on the weight lines. If the core has the weight indicating thatit is least loaded it is the winner and is selected. Several cores withthe same weight may be winners and selected.

Any queue belonging to a selected core can now be selected. This queueselection may be performed using a round robin process using remembranceof the last queue selected by the complement scheduler. This queueselection also may be performed by selecting the queue with the maximumof fairness employing circuitry which will make the selection in onecycle. Thus, the selection of the queue by the complement scheduler maybe accomplished by logic circuitry looking at all necessary informationin parallel and making the decision in one propagation time of thesystem clock through the logic circuits.

After the queue is selected, a thread may be selected using themechanisms described above with respect to thread selection by the basescheduler. In this case, the same logic circuitry that is used for coreselection by the base scheduler may be used to select from among severalselected cores having the same weight.

Thus, in accordance with an illustrative embodiment, basic andcomplement scheduling may be implemented using logic circuitry in amanner that provides for dual scheduling for very high speed packetthroughput. Basic and complement circuitry in accordance with anillustrative embodiment may run in parallel to always provide a queueand thread selection decision in the same scheduler running time period.This period cannot be longer than 6 ns in order to be able to make adecision on each packet received, for example, on a 100 Gbps Ethernetline. This time constraint calls for the use of logic circuitry asdescribed, for the selection of queues, cores, and threads, working inparallel, with one working cycle for each selection decision, providinga very few cycles, e.g., 6 cycles total with a 1 GHz technology wherethe size of the smallest 100 Gbps Ethernet packet is 6 ns.

The flowcharts and block diagrams in the different depicted embodimentsillustrate the architecture, functionality, and operation of somepossible implementations of apparatus and methods in differentadvantageous embodiments. In this regard, each block in the flowchart orblock diagrams may represent a module, segment, function, and/or aportion of an operation or step. In some alternative implementations,the function or functions noted in the block may occur out of the ordernoted in the figures. For example, in some cases, two blocks shown insuccession may be executed substantially concurrently, or the blocks maysometimes be executed in the reverse order, depending upon thefunctionality involved. Also, other blocks may be added in addition tothe illustrated blocks in a flowchart or block diagram.

The terminology used herein is for the purpose of describing particularembodiments only and is not intended to be limiting of the invention. Asused herein, the singular forms “a”, “an” and “the” are intended toinclude the plural forms as well, unless the context clearly indicatesotherwise. It will be further understood that the terms “comprises”and/or “comprising,” when used in this specification, specify thepresence of stated features, integers, steps, operations, elements,and/or components, but do not preclude the presence or addition of oneor more other features, integers, steps, operations, elements,components, and/or groups thereof.

The corresponding structures, materials, acts, and equivalents of allmeans or step plus function elements in the claims below are intended toinclude any structure, material, or act for performing the function incombination with other claimed elements as specifically claimed. Thedescription of the present invention has been presented for purposes ofillustration and explanation, but is not intended to be exhaustive orlimited to the invention to the form disclosed. Many modifications andvariations will be apparent to those of ordinary skill in the artwithout departing from the scope and spirit of the invention. Theillustrative embodiments were chosen and described in order to bestexplain the principles of the invention and the practical application,and to enable others of ordinary skill in the art to understand theinvention for various embodiments with various modifications as aresuited to the particular use contemplated.

1. A method of assigning work from a plurality of sources to a pluralityof sinks in a scheduling period, comprising: selecting a first sourcefrom the plurality of sources based on a source attribute; determiningwhether a sink is available for the first source based on assignmentconstraints between the plurality of sources and the plurality of sinks;when the sink is available for the first source, selecting a first sinkfor the first source based on the assignment constraints; and when thesink is not available for the first source, staying on the first sourcesuch that the same first source is selected as the first source in anext scheduling period, selecting a second sink from the plurality ofsinks based on a sink attribute, and selecting a second source for thesecond sink based on the assignment constraints.
 2. The method of claim1, wherein selecting the first source includes selecting the firstsource from the plurality of sources using a round robin selectionprocess.
 3. The method of claim 1, wherein the source attribute includesat least one source attribute selected to maintain fairness in selectinga first source from the plurality of sources.
 4. The method of claim 3,wherein the source attribute includes at least one source attributeselected from the group of source attributes consisting of tenure, anamount of time since a source was serviced by a sink, and an amount ofwork that the source has to be processed.
 5. The method of claim 1,wherein selecting a first sink includes selecting a first sink for thefirst source based on the source attribute related to sink efficiency.6. The method of claim 1, wherein the sink attribute includes at leastone attribute related to sink efficiency.
 7. The method of claim 6,wherein the sink attribute is selected from the group of sink attributesconsisting of cache size, cache occupancy, and a number of busy threadson a processor core.
 8. The method of claim 1, wherein selecting asecond sink includes selecting the second sink based on the sinkattribute related to sink efficiency.
 9. The method of claim 1 furthercomprising: when the sink is available for the first source, dispatchingwork from the first source to the first sink; and when the sink is notavailable for the first source, dispatching work from the second sourceto the second sink.
 10. The method of claim 1, wherein: the plurality ofsources are a plurality of data queues; the work includes data packetson the plurality of data queues; and the plurality of sinks are aplurality of processor threads adapted to process the data packets. 11.An apparatus for assigning work from a plurality of sources to aplurality of sinks in a scheduling period, comprising: a basic scheduleradapted to select a first source from the plurality of sources based ona source attribute, to determine whether a sink is available for thefirst source based on assignment constraints between the plurality ofsources and the plurality of sinks, to select a first sink for the firstsource based on the assignment constraints when a sink is available forthe first source, and to stay on the first source such that the samefirst source is selected as the first source by the basic scheduler in anext scheduling period and to defer to a complement scheduler when asink is not available for the first source; and a complement scheduleradapted to select a second sink from the plurality of sinks based on asink attribute and to select a second source for the second sink basedon the assignment constraints.
 12. The apparatus of claim 11, whereinthe basic scheduler includes a round robin scheduler for selecting thefirst source from the plurality of sources.
 13. The apparatus of claim11, wherein the source attribute includes at least one source attributeselected to maintain fairness in selecting a first source from theplurality of sources.
 14. The apparatus of claim 13, wherein the sourceattribute includes at least one source attribute selected from the groupof source attributes consisting of tenure, an amount of time since asource was serviced by a sink, and an amount of work that a source hasto be processed.
 15. The apparatus of claim 11, wherein selecting afirst sink includes selecting the first sink for the first source basedon a source attribute related to sink efficiency.
 16. The apparatus ofclaim 11, wherein the sink attribute includes at least one attributerelated to sink efficiency.
 17. The apparatus of claim 16, wherein thesink attribute is selected from the group of sink attributes consistingof cache size, cache occupancy, and a number of busy threads on aprocessor core.
 18. The apparatus of claim 11, wherein selecting asecond sink includes selecting the second sink based on the sinkattribute related to sink efficiency.
 19. The apparatus of claim 11,wherein: the plurality of sources are a plurality of data queues; thework includes data packets on the plurality of data queues; and theplurality of sinks are a plurality of processor threads adapted toprocess the data packets.
 20. The apparatus of claim 19 furthercomprising a packet injector adapted to dispatching work from the firstsource to the first sink when the basic scheduler does not defer to thecomplement scheduler and to dispatch work from the second source to thesecond sink when the basic scheduler does defer to the complementscheduler.